Clueing in DC: An Analysis of DC Crime Data from 2018-2022

DSAN 5200 Final Project

Authors
Affiliation

Brian Kwon

Georgetown University

Powell Sheagren

Dheeraj Oruganty

Published

April 29, 2024

Introduction

Crime is not a laughing matter yet there are still many games, board games even, which use a crime as an inciting incident. For this visual narrative we decided to flip the script and use a game format as a framing device for crime statistics, and if you don’t know which board game we’re talking about, the following paragraphs will clue you in!

In order to do this we will be using FBI crime data from the National Incident-Based Reporting System (NIBRS) specifically on reported offenses in DC. We will use this data to draw insights about crime in DC across years and through the effects of the global pandemic. We will then pivot to reporting on details relating to the alluded to board games’ three part question later on.

Figure 1: Comparing Crime Across Cities in 2019

Code
library(rvest)
library(tidyverse)
library(plotly)

# Parse 2019 crime rate data from wikipedia
url = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate"
page = read_html(url)
tables = html_table(page, fill = TRUE)
crime_data = tables[[1]]

# Preprocess the dataset
colnames(crime_data) = crime_data[2, ]
crime_data = crime_data[-c(1,2), ]
crime_data = crime_data %>% select(1,2,3,4) # Remove unnecessary columns
colnames(crime_data) = c("state", "city", "population", "crime_rate") # Change column names
crime_data$population = as.numeric(gsub(",", "", crime_data$population)) # Change to numeric
crime_data$crime_rate = as.numeric(crime_data$crime_rate) # Change to numeric

# Leave only one city per state by population
crime_data = crime_data %>%
  group_by(state) %>%
  slice(which.max(population))

# Remove footnote number
crime_data$city = gsub("\\d+$", "", crime_data$city)
crime_data$state =  gsub("\\d+$", "", crime_data$state)

# Change some city name manually for merging
crime_data = crime_data %>%
  mutate(city = if_else(city == "Washington, D.C.", "Washington", city)) %>%
  mutate(city = if_else(city == "Louisville Metro", "Louisville", city))

# Get latitude and longitude data
location = read.csv("./data/uscities.csv")
location = location %>% select(1,4,lat,lng)

# Merge two data sets
df = merge(crime_data, location, by = "city")
df = df %>% 
    filter(state == state_name) %>%
    select(-state_name)

# Color palette
colors = c("#F1FAEE", "#A8DADC", "#457B9D", "#1D3557", "#E63946")
# colors = c("#ccdbdc","#9ad1d4","#80ced7","#007ea7","#003249")
# colors = c("#ccdbdc","#edf8b1", "#7fcdbb", "#2c7fb8")
# colors = c("#caf0f8", "#ade8f4", "#90e0ef", "#48cae4", "#00b4d8", "#0096c7", "#0077b6", "#023e8a", "#03045e") 

# Plot bubble map
map = plot_geo(df, lat = ~lat, lon = ~lng) %>%
  add_markers(
    text = ~paste("State: ", state, "<br>City: ", city, "<br>Crime Rate: ", crime_rate, "<br>Population: ", population), 
    size = ~population, 
    color = ~crime_rate,
    colors = colors,
    opacity = 10000,
    marker = list(sizemode = 'area', sizeref = 0.2, line = list(color = 'black', width = 2))) %>%
    colorbar(title = "Crime Rate") %>%
  layout(title = 'Crime Rate Bubble Map for US cities in 2019', geo = list(scope = 'usa'),
         annotations = list(list(x = 0.8, y = 0.55, text = "Washington D.C.", showarrow = TRUE, xanchor = 'left', yanchor = 'bottom',  ax = 30, ay = 30, font = list(size = 12, color = "black")),
                       list(x = 1, y = 0,  text = "Size by population", showarrow = FALSE, xanchor='right', yanchor='auto', xshift=0, yshift=0, font=list(size=12, color="grey"))))
map

Figure 1: This map represents the relative population and crime rates in the U.S. cities. The size of the dots shows the population of the city and the color represents the amount of crime per 100,000 people.

DC is in the interesting situation of being a mix between a state and a city so it tends to have a higher crime rate than states on a per capita level despite being very similar to the 39 cities with the biggest population from each state. In 2019, DC had an average per capita crime rate among its peers and was nowhere near the highest crime rate areas. As we look further into the data over the pandemic years we should keep in mind that DC is not a criminal outlier despite what the comparable statistics and politicians might say.

Figure 2: Crime Proportions in DC From 2018-2022

Code
library(tidyverse)
library(DT)

# Read data files
offense_22 = read.csv("./data/DC-2022/NIBRS_OFFENSE.csv") 
offense_21 = read.csv("./data/DC-2021/NIBRS_OFFENSE.csv") 
offense_20 = read.csv("./data/DC-2020/NIBRS_OFFENSE.csv") 
offense_19 = read.csv("./data/DC-2019/NIBRS_OFFENSE.csv") 
offense_18 = read.csv("./data/DC-2018/NIBRS_OFFENSE.csv") 
offense_code1 = read.csv("./data/DC-2022/NIBRS_OFFENSE_TYPE.csv")
offense_code2 = read.csv("./data/DC-2018/NIBRS_OFFENSE_TYPE.csv")

# Merge with code files for corresponding offense names
offense_22 = merge(offense_22, offense_code1, by = "offense_code")
offense_21 = merge(offense_21, offense_code1, by = "offense_code")
offense_20 = merge(offense_20, offense_code2, by = "OFFENSE_TYPE_ID")
offense_19 = merge(offense_19, offense_code2, by = "OFFENSE_TYPE_ID")
offense_18 = merge(offense_18, offense_code2, by = "OFFENSE_TYPE_ID")

# Calculate the percentage based on the count
offense_22_count = as.data.frame(round(table(offense_22$offense_name)/nrow(offense_22)*100,2))
offense_21_count = as.data.frame(round(table(offense_21$offense_name)/nrow(offense_21)*100,2))
offense_20_count = as.data.frame(round(table(offense_20$OFFENSE_NAME)/nrow(offense_20)*100,2))
offense_19_count = as.data.frame(round(table(offense_19$OFFENSE_NAME)/nrow(offense_19)*100,2))
offense_18_count = as.data.frame(round(table(offense_18$OFFENSE_NAME)/nrow(offense_18)*100,2))

# Merge all years
offense_df = merge(merge(merge(merge(offense_18_count, offense_19_count, by = "Var1", all = TRUE), offense_20_count, by = "Var1", all = TRUE), offense_21_count, by = "Var1", all = TRUE), offense_22_count, by = "Var1", all = TRUE)
colnames(offense_df) = c("Offense Type", "2018", "2019", "2020", "2021", "2022")

# Create datatable
datatable(data = offense_df, caption = "Table", filter = "top")

Figure 2: Data was collected from the 2018-2022 FBI’s National Incident-Based Reporting System. The values are the percentage of a total crime that an individual offense made up.

Moving to an increased time frame, we can see the representation of various crimes over the years, from some that stayed rare if consistent to ones which were sharply affected by events such as the pandemic. For instance, robbery and purse snatching were one of the most likely offenses before the pandemic but decreased during it before resurfacing in 2022. With less people out in public this crime happened much less often. However, simple assault and destruction of property remained two of the most prominent crimes over all of the years. This does not show the full complexity of the offense data though, so next we will expand upon this data and look at the relationships between various offenses.

Figure 3: Relationships Between Offenses Within Incidents

Code
import plotly.graph_objects as go
import numpy as np
import networkx as nx

## Code for this graph generously donated from:
# https://plotly.com/python/network-graphs/

## importing matrix
matrix = np.genfromtxt('./data/adjacency_matrix.csv', delimiter = ",")

## Turning adjacency matrix to graph obkect
G = nx.from_numpy_array(matrix,create_using=nx.DiGraph)

## Using a spiral layout to show centrality
pos = nx.shell_layout(G)

## Adding position based on the layout
for i in range(0,42):
    for g in range(0,42):
        G.nodes[i]['pos'] = pos[i]
        G.nodes[g]['pos'] = pos[g]

## Adding edges together
edge_x = []
edge_y = []
for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)

## arranging them into lines
edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=0.5, color='#888'),
    hoverinfo='none',
    mode='lines')

## adding nodes to graph
node_x = []
node_y = []
for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)

## assembly again
node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale = ["#1D3557", "#457B9D", "#A8DADC", "#F1FAEE"],
        reversescale=True,
        color=[],
        size=10,
        colorbar=dict(
            thickness=15,
            title='Node Connections',
            xanchor='left',
            titleside='right'
        ),
        line_width=2))

## Offenses for tooltip
offenses_list = [ "Destruction/Damage/Vandalism of Property",    "Theft From Motor Vehicle"  ,                 
  "Robbery"                      ,               "Simple Assault"           ,                  
  "Intimidation"                 ,               "All Other Larceny"         ,                 
  "Motor Vehicle Theft"           ,              "Drug Equipment Violations" ,                 
 "Drug/Narcotic Violations"     ,               "Weapon Law Violations"     ,                 
 "Stolen Property Offenses"     ,               "Aggravated Assault"        ,                 
 "Purse-snatching"              ,               "Extortion/Blackmail"         ,               
 "Theft From Building"          ,               "Fondling"                  ,                 
 "Counterfeiting/Forgery"       ,               "Theft of Motor Vehicle Parts or Accessories",
 "Credit Card/Automated Teller Machine Fraud" , "Impersonation"      ,                        
 "Pocket-picking"                     ,         "Kidnapping/Abduction"     ,                  
 "False Pretenses/Swindle/Confidence Game"   ,  "Burglary/Breaking & Entering"       ,        
 "Rape"                      ,                  "Murder and Nonnegligent Manslaughter"   ,    
 "Theft From Coin-Operated Machine or Device" , "Animal Cruelty"          ,                   
 "Shoplifting"             ,                    "Hacking/Computer Invasion"      ,            
 "Identity Theft"         ,                     "Wire Fraud"                ,                 
 "Arson"                 ,                      "Betting/Wagering"         ,                  
 "Welfare Fraud"        ,                       "Pornography/Obscene Material"     ,          
 "Bribery"           ,                          "Purchasing Prostitution"       ,             
 "Prostitution"                 ,               "Sodomy"                 ,                    
 "Sexual Assault With An Object", "Other"]

# getting tooltip
node_adjacencies = []
node_text = []
for node, adjacencies in enumerate(G.adjacency()):
    node_adjacencies.append(len(adjacencies[1]))
    node_text.append("Offense Type: " + offenses_list[node] + ' | # of connections: '+str(len(adjacencies[1])))

node_trace.marker.color = node_adjacencies
node_trace.text = node_text

## Plotting the figure
fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='Amount of times an Offense was Listed with other Offenses',
                titlefont_size=16,
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                annotations=[ dict(
                    text="",
                    showarrow=False,
                    xref="paper", yref="paper",
                    x=0.005, y=-0.002 ) ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                )

fig.update_traces(marker=dict(size=node_adjacencies));
fig.show()

Figure 3: The network diagram shows data from all years where one incident involved multiple offenses. The lines represent offenses that were listed in the same incident and the size of the nodes show the amount of other offenses each offense was listed with.

It’s important to note that these offenses are not siloed and in fact interact in different ways across incidents. The NIBRS data set works on an incident level and above you can see the offenses which were most connected to other offenses. As it can be seen, there are a lot of connections with offenses like simple assault and theft chargers which had the most combinations with the others. In any given incident it’s never going to be as clean cut and this visualization shows that level of interconnectivity.

So far we have been looking at general information about crime statistics in DC; we will now pivot to talking about specific aspects of the offenses recorded but we ask you to look at this as if it were a problem from a, as you may have guessed, a game of Clue! “A crime has happened to Mr. Mo Boddy in DC, and it is your job to figure out what happened. You must figure out who committed the crime, where he/she committed it, and with what weapon.” As we continue we will assume that any of the offenses happened to him but feel free to pick an offense or offense category and follow along with that offense in mind.

Figure 4: Offense by Relationship to Victim Heatmap

Code
library(tidyverse)
library(plotly)
library(heatmaply)

## 2018
offense_data_2018 <- read.csv("data/DC-2018/NIBRS_OFFENSE.csv") %>% mutate(year = 2018)
offense_2018 <- read.csv("data/DC-2018/NIBRS_OFFENSE_TYPE.csv")
victim_data_2018 <- read.csv("data/DC-2018/NIBRS_VICTIM.csv") %>% mutate(year = 2018)
relation_2018 <- read.csv("data/DC-2018/NIBRS_VICTIM_OFFENDER_REL.csv")
relationship_2018 <- read.csv("data/DC-2018/NIBRS_RELATIONSHIP.csv")

offense_data_2018 <- left_join(offense_data_2018,offense_2018, by = "OFFENSE_TYPE_ID")
relation_2018 <- left_join(relation_2018,relationship_2018, by = "RELATIONSHIP_ID")
victim_data_2018 <- right_join(victim_data_2018,relation_2018, by = "VICTIM_ID")
total_data_2018 <- left_join(victim_data_2018,offense_data_2018, by = c("INCIDENT_ID","year"))
total_data_2018 <- total_data_2018 %>% select(c(RELATIONSHIP_NAME,OFFENSE_CATEGORY_NAME, year))

## 2019
offense_data_2019 <- read.csv("data/DC-2019/NIBRS_OFFENSE.csv") %>% mutate(year = 2019)
offense_2019 <- read.csv("data/DC-2019/NIBRS_OFFENSE_TYPE.csv")
victim_data_2019 <- read.csv("data/DC-2019/NIBRS_VICTIM.csv") %>% mutate(year = 2019)
relation_2019 <- read.csv("data/DC-2019/NIBRS_VICTIM_OFFENDER_REL.csv")
relationship_2019 <- read.csv("data/DC-2019/NIBRS_RELATIONSHIP.csv")

offense_data_2019 <- left_join(offense_data_2019,offense_2019, by = "OFFENSE_TYPE_ID")
relation_2019 <- left_join(relation_2019,relationship_2019, by = "RELATIONSHIP_ID")
victim_data_2019 <- right_join(victim_data_2019,relation_2019, by = "VICTIM_ID")
total_data_2019 <- left_join(victim_data_2019,offense_data_2019, c("INCIDENT_ID","year"))
total_data_2019 <- total_data_2019 %>% select(c(RELATIONSHIP_NAME,OFFENSE_CATEGORY_NAME, year))

## 2020
offense_data_2020 <- read.csv("data/DC-2020/NIBRS_OFFENSE.csv") %>% mutate(year = 2020)
offense_2020 <- read.csv("data/DC-2020/NIBRS_OFFENSE_TYPE.csv")
victim_data_2020 <- read.csv("data/DC-2020/NIBRS_VICTIM.csv") %>% mutate(year = 2020)
relation_2020 <- read.csv("data/DC-2020/NIBRS_VICTIM_OFFENDER_REL.csv")
relationship_2020 <- read.csv("data/DC-2020/NIBRS_RELATIONSHIP.csv")

offense_data_2020 <- left_join(offense_data_2020,offense_2020, by = "OFFENSE_TYPE_ID")
relation_2020 <- left_join(relation_2020,relationship_2020, by = "RELATIONSHIP_ID")
victim_data_2020 <- right_join(victim_data_2020,relation_2020, by = "VICTIM_ID")
total_data_2020 <- left_join(victim_data_2020,offense_data_2020, by = c("INCIDENT_ID","year"))
total_data_2020 <- total_data_2020 %>% select(c(RELATIONSHIP_NAME,OFFENSE_CATEGORY_NAME, year))

## 2021
offense_data_2021 <- read.csv("data/DC-2021/NIBRS_OFFENSE.csv") %>% mutate(year = 2021)
offense_2021 <- read.csv("data/DC-2021/NIBRS_OFFENSE_TYPE.csv")
victim_data_2021 <- read.csv("data/DC-2021/NIBRS_VICTIM.csv") %>% mutate(year = 2021)
relation_2021 <- read.csv("data/DC-2021/NIBRS_VICTIM_OFFENDER_REL.csv")
relationship_2021 <- read.csv("data/DC-2021/NIBRS_RELATIONSHIP.csv")

offense_data_2021 <- left_join(offense_data_2021,offense_2021, by = "offense_code")
relation_2021 <- left_join(relation_2021,relationship_2021, by = "relationship_id")
victim_data_2021 <- right_join(victim_data_2021,relation_2021, by = "victim_id")
total_data_2021 <- left_join(victim_data_2021,offense_data_2021, by = c("incident_id","year"))
total_data_2021 <- total_data_2021 %>% select(c(relationship_name,offense_category_name, year))

## 2022
offense_data_2022 <- read.csv("data/DC-2022/NIBRS_OFFENSE.csv") %>% mutate(year = 2022)
offense_2022 <- read.csv("data/DC-2022/NIBRS_OFFENSE_TYPE.csv")
victim_data_2022 <- read.csv("data/DC-2022/NIBRS_VICTIM.csv") %>% mutate(year = 2022)
relation_2022 <- read.csv("data/DC-2022/NIBRS_VICTIM_OFFENDER_REL.csv")
relationship_2022 <- read.csv("data/DC-2022/NIBRS_RELATIONSHIP.csv")

offense_data_2022 <- left_join(offense_data_2022,offense_2022, by = "offense_code")
relation_2022 <- left_join(relation_2022,relationship_2022, by = "relationship_id")
victim_data_2022 <- right_join(victim_data_2022,relation_2022, by = "victim_id")
total_data_2022 <- left_join(victim_data_2022,offense_data_2022, by = c("incident_id","year"))
total_data_2022 <- total_data_2022 %>% select(c(relationship_name,offense_category_name, year))

## adjusting colnames for difference
colnames(total_data_2021) <- c("RELATIONSHIP_NAME","OFFENSE_CATEGORY_NAME", "year")
colnames(total_data_2022) <- c("RELATIONSHIP_NAME","OFFENSE_CATEGORY_NAME", "year")

## groups
total_data_relation <- rbind(total_data_2018, total_data_2019, total_data_2020, total_data_2021, total_data_2022)

## relationships store for next chunk
relationships <- total_data_relation$RELATIONSHIP_NAME %>% factor() %>% levels()

## Splitting the relationships type into indicies and then filtering by them
family_relationships_index <- c(6,14,15,16,19,21,22)
partner_relationships_index <- c(1,5,7,8,11,12,23,24,25,26)
acquaintance_relationships_index <- c(3,4,9,10,13,17,18,20)
stranger_relationships_index <- c(27)
other_relationships_index <- c(2)

family_relationships <- relationships[family_relationships_index]
partner_relationships <- relationships[partner_relationships_index]
acquaintance_relationships <- relationships[acquaintance_relationships_index]
stranger_relationships <- relationships[stranger_relationships_index]
other_relationships <- relationships[other_relationships_index]

## Function for new column of values
relation_checker <- function(value){
  if(value %in% family_relationships){
    val <- "Family"
  } else if(value %in% partner_relationships){
    val <- "Partner/Partners Family"
  } else if(value %in% acquaintance_relationships){
    val <- "Acquaintance"
  } else if(value %in% stranger_relationships){
    val <- "Stranger"
  } else{
    val <- "Other"
  }
}

## vectorizing the function and adding the colum
relation_checker <- Vectorize(relation_checker)

total_data_relation <- total_data_relation %>% mutate(Relation_group = relation_checker(RELATIONSHIP_NAME)) %>% filter(Relation_group != "Other")
#total_data_relation$Relation_group %>% table()

## Making matrix for Viz
mat <- total_data_relation %>% group_by(Relation_group,OFFENSE_CATEGORY_NAME) %>% tally() %>%
  spread(Relation_group,n) %>% as.data.frame()
mat[is.na(mat)] <- 0
rownames(mat) <- mat$OFFENSE_CATEGORY_NAME
mat <- mat %>% select(-OFFENSE_CATEGORY_NAME)

# Color palette
colors = c("#F1FAEE", "#A8DADC", "#457B9D", "#1D3557")
# colors = c("#ccdbdc","#edf8b1", "#7fcdbb", "#2c7fb8")

## Heatmap code
ptotal <- heatmaply(mat,
                    label_names = c("Crime Group", "Relation", "Relation Prevelance"),
                    colors = colors,
                    # width  = 800, 
                    height = 600,
                    dendrogram = FALSE,
                    # limits = c(0,10000),
                    scale = "row",
                    branches_lwd = 0.1,
                    # hide_colorbar = TRUE,
                    grid_color = "white",
                    grid_width = 0.00001,
                    dend_hoverinfo = FALSE,
                    main = "Heatmap of offense category by relationship between victim and offender")
ptotal

Figure 4: This heat map shows the amount, by color, of each offense category and what the relationship of the victim was to the offender. It is scaled by row so the higher value means more likely and lower value means less likely per offense category.

For the first question of the game “who committed the crime” we wanted to look at the relationship between the victim and the offender per each offense category through a heatmap. Most commonly the offenses were perpetrated by strangers but there were some which were more likely to happen by acquaintances such as embezzlement or by family for arson. The variance from offense category to offense category also shows that crime isn’t always an external force but can also come from people not traditionally associated such as employers, babysitters, or acquaintances. Despite this nuance, overall Mr. Mo Body was most likely a victim to a stranger despite quirks with some offense categories. With this answer confirmed, let’s consider the next question Clue would require.

Figure 5: Interactive Location of Offenses in 2018, 2020, 2022

Figure 5: Interactive Visualization of Top 15 Crime Incidents by Location in Washington D.C. for the Years 2018, 2020, and 2022

In order to address the subsequent question regarding location, we can leverage an interactive bar chart that presents a comprehensive breakdown of the specific sites where crimes were most frequently reported. For example, by selecting the category “All Other Larceny,” the chart vividly displays its occurrences across diverse settings such as airports, bus stations, train terminals, as well as commercial and office buildings, thereby highlighting the spatial distribution of criminal activities. This visualization effectively challenges the game’s underlying assumptions, illustrating how crime impacts different locations. The predominant locations for crimes also exhibit variations over the years, with a notable shift towards residential areas as opposed to public spaces. In a cumulative analysis across the years, Mr. Mo Body was most frequently victimized at their residence, although this likelihood varies significantly depending on the type of offense. Next, we will explore the types of weapons used in these incidents.

Figure 6: Sankey Diagram of Weapon Type and Injury by Offense

Code
library(tidyverse)
library(networkD3)
library(htmlwidgets)
library(htmltools)

# Read all necessary files
offense_18 = read.csv("./data/DC-2018/NIBRS_OFFENSE.csv") 
offender_18 = read.csv("./data/DC-2018/NIBRS_OFFENDER.csv") 
victim_18 = read.csv("./data/DC-2018/NIBRS_VICTIM.csv")
weapon_18 = read.csv("./data/DC-2018/NIBRS_WEAPON.csv") 
injury_18 = read.csv("./data/DC-2018/NIBRS_VICTIM_INJURY.csv")

# Select offense_id, incident_id, offender_id, victim_id, offense_code, injury_id, weapon_id
offense_18 = offense_18 %>% select(2,3,4)
offender_18 = offender_18 %>% select(2,3)
victim_18 = victim_18 %>% select(2,3)
weapon_18 = weapon_18 %>% select(2,3)
injury_18 = injury_18 %>% select(2,3)

# Read codes files for nodes
offense_code = read.csv("./data/DC-2018/NIBRS_OFFENSE_TYPE.csv")
injury_code = read.csv("./data/DC-2018/NIBRS_INJURY.csv")
weapon_code = read.csv("./data/DC-2018/NIBRS_WEAPON_TYPE.csv")

# Get offense_code, offense_type_id, offense_name
offense_code = offense_code %>% select(1,2,3)
# Change offense_type_id to offense_code
offense_18 = merge(offense_18, offense_code, by = "OFFENSE_TYPE_ID")
offense_18 = offense_18 %>% select(2,3,4)
# Merge by incident_id, offense_id, victim_id
df_18 = merge(merge(merge(merge(offense_18, offender_18, by = "INCIDENT_ID"), victim_18, by = "INCIDENT_ID"), injury_18, by = "VICTIM_ID"), weapon_18, by = "OFFENSE_ID")
# Remove incident_id, offense_id, victim_id, offender_id
df_18 = df_18 %>% select(-1,-2,-3,-5)

# # Make column names to lower case
colnames(df_18) = tolower(colnames(df_18))

# Paste character to make ids unique
df_18$injury_id = paste0("i", df_18$injury_id)
df_18$weapon_id = paste0("w", df_18$weapon_id)

# # Count the unique combinations of offense types and weapon types and subset if there are more than 100 cases
first_link = df_18 %>%
    group_by(offense_code, weapon_id) %>%
    summarise(value = n(), .groups = "drop") %>%
    arrange(desc(value)) %>%
    rename(source = offense_code, target = weapon_id) %>%
    filter(value > 100)
# # Count the unique combinations of weapon types and injury types and subset if there are more than 100 cases
second_link = df_18 %>%
    group_by(weapon_id, injury_id) %>%
    summarise(value = n(), .groups = "drop") %>%
    arrange(desc(value)) %>%
    rename(source = weapon_id, target = injury_id) %>%
    filter(value > 100)
# # Combine those two links
links.df = as.data.frame(rbind(first_link,second_link))

# Get the codes and names
offense_code = offense_code %>%
    select(2,3) %>%
    rename(name = OFFENSE_CODE, label = OFFENSE_NAME)
injury_code = injury_code %>% 
    select(1,3) %>%
    rename(name = INJURY_ID, label = INJURY_NAME)
weapon_code = weapon_code %>% 
    select(1,3) %>%
    rename(name = WEAPON_ID, label = WEAPON_NAME)

# Make codes unique
injury_code$name = paste0("i", injury_code$name)
weapon_code$name = paste0("w", weapon_code$name)
# Combine all the nodes
nodes.df = rbind(offense_code, injury_code, weapon_code)
# Subset only nodes from the links
nodes.df = nodes.df %>% filter(name %in% c(unique(first_link$source),unique(first_link$target),unique(second_link$target)))

# Create source_id and target_id for a sankey diagram
links.df$source_id = match(links.df$source, nodes.df$name) - 1 
links.df$target_id = match(links.df$target, nodes.df$name) - 1 

# Color Palette
my_color = 'd3.scaleOrdinal().range(["#F1FAEE", "#A8DADC", "#457B9D", "#1D3557"])'

# Create a sankey diagram
net = sankeyNetwork(Links = links.df,     
              Nodes = nodes.df,     
              Source = 'source_id', 
              Target = 'target_id', 
              Value = 'value',     
              NodeID = 'label',      
              fontSize = 16,        
              colourScale=my_color, 
              iterations = 0)

# Add a title
net_with_title = prependContent(net, tags$b(HTML('Injuries and weapon type by offense type in 2018')))
net_with_title
Injuries and weapon type by offense type in 2018
Code
library(tidyverse)
library(networkD3)
library(htmlwidgets)
library(htmltools)

# Read all necessary files
offense_22 = read.csv("./data/DC-2022/NIBRS_OFFENSE.csv") 
offender_22 = read.csv("./data/DC-2022/NIBRS_OFFENDER.csv") 
victim_22 = read.csv("./data/DC-2022/NIBRS_VICTIM.csv")
weapon_22 = read.csv("./data/DC-2022/NIBRS_WEAPON.csv") 
injury_22 = read.csv("./data/DC-2022/NIBRS_VICTIM_INJURY.csv")


# Select offense_id, incident_id, offender_id, victim_id, offense_code, injury_id, weapon_id
offense_22 = offense_22 %>% select(2,3,4)
offender_22 = offender_22 %>% select(2,3)
victim_22 = victim_22 %>% select(2,3)
weapon_22 = weapon_22 %>% select(2,3)
injury_22 = injury_22 %>% select(2,3)

# Merge by incident_id, offense_id, victim_id
df_22 = merge(merge(merge(merge(offense_22, offender_22, by = "incident_id"), victim_22, by = "incident_id"), injury_22, by = "victim_id"), weapon_22, by = "offense_id")
# Remove incident_id, offense_id, victim_id, offender_id
df_22 = df_22 %>% select(-1,-2,-3,-5)

# Paste character to make ids unique
df_22$injury_id = paste0("i", df_22$injury_id)
df_22$weapon_id = paste0("w", df_22$weapon_id)

# Count the unique combinations of offense types and weapon types and subset if there are more than 100 cases
first_link = df_22 %>%
    group_by(offense_code, weapon_id) %>%
    summarise(value = n(), .groups = "drop") %>%
    arrange(desc(value)) %>%
    rename(source = offense_code, target = weapon_id) %>%
    filter(value > 100)
# Count the unique combinations of weapon types and injury types and subset if there are more than 100 cases
second_link = df_22 %>%
    group_by(weapon_id, injury_id) %>%
    summarise(value = n(), .groups = "drop") %>%
    arrange(desc(value)) %>%
    rename(source = weapon_id, target = injury_id) %>%
    filter(value > 100)
# Combine those two links
links.df = as.data.frame(rbind(first_link,second_link))

# Read codes files for nodes
offense_code = read.csv("./data/DC-2022/NIBRS_OFFENSE_TYPE.csv")
injury_code = read.csv("./data/DC-2022/NIBRS_INJURY.csv")
weapon_code = read.csv("./data/DC-2022/NIBRS_WEAPON_TYPE.csv")

# Get the codes and names
offense_code = offense_code %>% 
    select(1,2) %>%
    rename(name = offense_code, label = offense_name)
injury_code = injury_code %>% 
    select(1,3) %>%
    rename(name = injury_id, label = injury_name)
weapon_code = weapon_code %>% 
    select(1,3) %>%
    rename(name = weapon_id, label = weapon_name)

# Make codes unique
injury_code$name = paste0("i", injury_code$name)
weapon_code$name = paste0("w", weapon_code$name)
# Combine all the nodes
nodes.df = rbind(offense_code, injury_code, weapon_code)
# Subset only nodes from the links
nodes.df = nodes.df %>% filter(name %in% c(unique(first_link$source),unique(first_link$target),unique(second_link$target)))

# Create source_id and target_id for a sankey diagram
links.df$source_id = match(links.df$source, nodes.df$name) - 1 
links.df$target_id = match(links.df$target, nodes.df$name) - 1 

# Color Palette
my_color = 'd3.scaleOrdinal().range(["#F1FAEE", "#A8DADC", "#457B9D", "#1D3557"])'

# Color groupings
# nodes.df = nodes.df %>%
#   mutate(group = ifelse(name == "13B", "a",
#                           ifelse(name == "13A", "b", 
#                                 ifelse(name == "120", "c", "g")))) %>%
#   mutate(group = ifelse(name == "w41", "e", "g")) %>%
#   mutate(group = ifelse(name == "i4", "f", "g"))

# Create a sankey diagram
net = sankeyNetwork(Links = links.df,     
              Nodes = nodes.df,     
              Source = 'source_id', 
              Target = 'target_id', 
              Value = 'value',     
              NodeID = 'label',      
              fontSize = 16,    
              colourScale=my_color,    
              iterations = 0)

# Add a title
net_with_title = prependContent(net, tags$b(HTML('Injuries and weapon type by offense type in 2022')))
net_with_title
Injuries and weapon type by offense type in 2022

Figure 6: This sankey diagram shows the offenses, the weapons used, and the amount of injury caused in 2018 and 2022. The paths between the values show the flow of this amount. The data was subsetted so that the diagrams only included connections that happened more than 100 times.

Lastly, the question comes to which weapon was used in the offense. This question doesn’t have as easy of an answer as the game; candelabras, fire pokers, and kitchen knives are not as common as the game would imply. Comparing between 2018 and 2022, the usage of handguns has emerged from patterns in the data; the handgun was not listed as an option before 2021. However, the top three offense types have been the same, despite the weapon usage being different. For example, there were more robberies with handguns than with personal weapons in 2022 compared with 2018. But when looking over all the years of the data it seems that personal objects were the most likely to be used although it doesn’t narrow down the Clue answer much. But as that was the last question, lets bring it all together.

Conclusion

Overall, based on the NIBRS data, if Mr. Mo Body was a victim of an offense, it would be by a stranger in their residence with a personal weapon. Was the answer different with the offense that you chose? Whether it was or wasn’t, we hoped to show the intersections of various aspects of criminal incidents and to give you something to think about before you play your next game of Clue.